Predict Bike Sharing Demand with AutoGluon Template¶

Project: Predict Bike Sharing Demand with AutoGluon¶

This notebook is a template with each step that you need to complete for the project.

Please fill in your code where there are explicit ? markers in the notebook. You are welcome to add more cells and code as you see fit.

Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.

File-> Export Notebook As... -> Export Notebook as HTML

There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.

Completing the code template and writeup template will cover all of the rubric points for this project.

The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.

Step 1: Create an account with Kaggle¶

Create Kaggle Account and download API key¶

Below is example of steps to get the API username and key. Each student will have their own username and key.

  1. Open account settings. kaggle1.png kaggle2.png
  2. Scroll down to API and click Create New API Token. kaggle3.png kaggle4.png
  3. Open up kaggle.json and use the username and key. kaggle5.png

Step 2: Download the Kaggle dataset using the kaggle python library¶

Open up Sagemaker Studio and use starter template¶

  1. Notebook should be using a ml.t3.medium instance (2 vCPU + 4 GiB)
  2. Notebook should be using kernal: Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)

Install packages¶

In [ ]:
!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
# Without --no-cache-dir, smaller aws instances may have trouble installing
In [ ]:
!pip install -U python-dotenv
!pip install -U kaggle
In [ ]:
!pip install -U pandas-profiling
!pip install ipywidgets==7.7.2
!pip install pydantic==1.10.2

Setup Kaggle API Key¶

In [2]:
# create the .kaggle directory and an empty kaggle.json file
!mkdir -p /root/.kaggle
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
In [3]:
from dotenv import load_dotenv 
from os import environ
load_dotenv()
Out[3]:
True
In [4]:
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = environ.get("KAGGLE_USERNAME")
kaggle_key = environ.get("KAGGLE_KEY")

# Save API token the kaggle.json file
with open("/root/.kaggle/kaggle.json", "w") as f:
    f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))

Download and explore dataset¶

Go to the bike sharing demand competition and agree to the terms¶

kaggle6.png

In [5]:
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
#!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip
Archive:  bike-sharing-demand.zip
  inflating: sampleSubmission.csv    
  inflating: test.csv                
  inflating: train.csv               
In [6]:
import pandas as pd
from autogluon.tabular import TabularPredictor
In [7]:
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
In [8]:
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv("train.csv", parse_dates=["datetime"])
train.head()
Out[8]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1
In [9]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB
In [10]:
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()
Out[10]:
season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000
In [11]:
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv("test.csv", parse_dates=["datetime"])
test.head()
Out[11]:
datetime season holiday workingday weather temp atemp humidity windspeed
0 2011-01-20 00:00:00 1 0 1 1 10.66 11.365 56 26.0027
1 2011-01-20 01:00:00 1 0 1 1 10.66 13.635 56 0.0000
2 2011-01-20 02:00:00 1 0 1 1 10.66 13.635 56 0.0000
3 2011-01-20 03:00:00 1 0 1 1 10.66 12.880 56 11.0014
4 2011-01-20 04:00:00 1 0 1 1 10.66 12.880 56 11.0014
In [12]:
# Same thing as train and test dataset
submission = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission.head()
Out[12]:
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0

Step 3: Train a model using AutoGluon’s Tabular Prediction¶

Requirements:

  • We are predicting count, so it is the label we are setting.
  • Ignore casual and registered columns as they are also not present in the test dataset.
  • Use the root_mean_squared_error as the metric to use for evaluation.
  • Set a time limit of 10 minutes (600 seconds).
  • Use the preset best_quality to focus on creating the best model.
In [13]:
learner_kwargs = {
    "ignored_columns": ["casual", "registered"]
}

predictor = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression", 
    eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20221230_044222/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20221230_044222/"
AutoGluon Version:  0.6.1
Python Version:     3.7.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows:    10886
Train Data Columns: 11
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered']
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    3054.59 MB
	Train Data (Original)  Memory Usage: 0.78 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
  good_rows = series[~series.isin(bad_rows)].astype(np.int64)
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('datetime', []) : 1 | ['datetime']
		('float', [])    : 3 | ['temp', 'atemp', 'windspeed']
		('int', [])      : 5 | ['season', 'holiday', 'workingday', 'weather', 'humidity']
	Types of features in processed data (raw dtype, special dtypes):
		('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
		('int', [])                  : 3 | ['season', 'weather', 'humidity']
		('int', ['bool'])            : 2 | ['holiday', 'workingday']
		('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
	0.5s = Fit runtime
	9 features in original data used to generate 13 features in processed data.
	Train Data (Processed) Memory Usage: 0.98 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.62s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.49s of the 599.38s of remaining time.
	-101.5462	 = Validation score   (-root_mean_squared_error)
	0.03s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 397.29s of the 597.19s of remaining time.
	-84.1251	 = Validation score   (-root_mean_squared_error)
	0.03s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 396.94s of the 596.83s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-131.4609	 = Validation score   (-root_mean_squared_error)
	65.58s	 = Training   runtime
	6.75s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 320.06s of the 519.95s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-131.0542	 = Validation score   (-root_mean_squared_error)
	31.54s	 = Training   runtime
	1.4s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 283.73s of the 483.62s of remaining time.
	-116.5443	 = Validation score   (-root_mean_squared_error)
	11.12s	 = Training   runtime
	0.55s	 = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 269.42s of the 469.31s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-130.5252	 = Validation score   (-root_mean_squared_error)
	200.14s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 64.9s of the 264.79s of remaining time.
	-124.5881	 = Validation score   (-root_mean_squared_error)
	6.54s	 = Training   runtime
	0.68s	 = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 54.91s of the 254.8s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-138.9149	 = Validation score   (-root_mean_squared_error)
	74.59s	 = Training   runtime
	0.77s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 173.12s of remaining time.
	-84.1251	 = Validation score   (-root_mean_squared_error)
	0.73s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 172.3s of the 172.27s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-60.4113	 = Validation score   (-root_mean_squared_error)
	54.59s	 = Training   runtime
	3.12s	 = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 112.97s of the 112.95s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-55.0656	 = Validation score   (-root_mean_squared_error)
	25.47s	 = Training   runtime
	0.3s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 83.4s of the 83.37s of remaining time.
	-53.42	 = Validation score   (-root_mean_squared_error)
	26.83s	 = Training   runtime
	0.62s	 = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 53.51s of the 53.48s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-55.7491	 = Validation score   (-root_mean_squared_error)
	57.02s	 = Training   runtime
	0.07s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -7.83s of remaining time.
	-53.1144	 = Validation score   (-root_mean_squared_error)
	0.28s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 608.3s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20221230_044222/")

Review AutoGluon's training run with ranking of models that did the best.¶

In [14]:
predictor.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -53.114374      14.542919  553.766336                0.000817           0.278653            3       True         14
1   RandomForestMSE_BAG_L2  -53.420043      11.062850  416.399112                0.618036          26.825663            2       True         12
2          LightGBM_BAG_L2  -55.065570      10.741559  415.044961                0.296744          25.471513            2       True         11
3          CatBoost_BAG_L2  -55.749057      10.512298  446.595644                0.067484          57.022196            2       True         13
4        LightGBMXT_BAG_L2  -60.411326      13.559839  444.168311                3.115024          54.594863            2       True         10
5    KNeighborsDist_BAG_L1  -84.125061       0.103688    0.029149                0.103688           0.029149            1       True          2
6      WeightedEnsemble_L2  -84.125061       0.104831    0.762941                0.001143           0.733792            2       True          9
7    KNeighborsUnif_BAG_L1 -101.546199       0.104609    0.032093                0.104609           0.032093            1       True          1
8   RandomForestMSE_BAG_L1 -116.544294       0.552854   11.122160                0.552854          11.122160            1       True          5
9     ExtraTreesMSE_BAG_L1 -124.588053       0.682034    6.536114                0.682034           6.536114            1       True          7
10         CatBoost_BAG_L1 -130.525167       0.080831  200.143895                0.080831         200.143895            1       True          6
11         LightGBM_BAG_L1 -131.054162       1.400994   31.543081                1.400994          31.543081            1       True          4
12       LightGBMXT_BAG_L1 -131.460909       6.754477   65.577447                6.754477          65.577447            1       True          3
13  NeuralNetFastAI_BAG_L1 -138.914862       0.765329   74.589510                0.765329          74.589510            1       True          8
Number of models trained: 14
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_NNFastAiTabular', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_KNN', 'StackerEnsembleModel_RF'}
Bagging used: True  (with 8 folds)
Multi-layer stack-ensembling used: True  (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
('int', [])                  : 3 | ['season', 'weather', 'humidity']
('int', ['bool'])            : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20221230_044222/SummaryOfModels.html
*** End of fit() summary ***
Out[14]:
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
  'KNeighborsDist_BAG_L1': -84.12506123181602,
  'LightGBMXT_BAG_L1': -131.46090891834504,
  'LightGBM_BAG_L1': -131.054161598899,
  'RandomForestMSE_BAG_L1': -116.54429428704391,
  'CatBoost_BAG_L1': -130.52516708977194,
  'ExtraTreesMSE_BAG_L1': -124.58805258915959,
  'NeuralNetFastAI_BAG_L1': -138.9148618317948,
  'WeightedEnsemble_L2': -84.12506123181602,
  'LightGBMXT_BAG_L2': -60.41132611426569,
  'LightGBM_BAG_L2': -55.06556954800326,
  'RandomForestMSE_BAG_L2': -53.42004335942844,
  'CatBoost_BAG_L2': -55.74905694074817,
  'WeightedEnsemble_L3': -53.11437398485209},
 'model_best': 'WeightedEnsemble_L3',
 'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/KNeighborsUnif_BAG_L1/',
  'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/KNeighborsDist_BAG_L1/',
  'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/LightGBMXT_BAG_L1/',
  'LightGBM_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/LightGBM_BAG_L1/',
  'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/RandomForestMSE_BAG_L1/',
  'CatBoost_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/CatBoost_BAG_L1/',
  'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/ExtraTreesMSE_BAG_L1/',
  'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/NeuralNetFastAI_BAG_L1/',
  'WeightedEnsemble_L2': 'AutogluonModels/ag-20221230_044222/models/WeightedEnsemble_L2/',
  'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/LightGBMXT_BAG_L2/',
  'LightGBM_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/LightGBM_BAG_L2/',
  'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/RandomForestMSE_BAG_L2/',
  'CatBoost_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/CatBoost_BAG_L2/',
  'WeightedEnsemble_L3': 'AutogluonModels/ag-20221230_044222/models/WeightedEnsemble_L3/'},
 'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.03209257125854492,
  'KNeighborsDist_BAG_L1': 0.02914905548095703,
  'LightGBMXT_BAG_L1': 65.57744669914246,
  'LightGBM_BAG_L1': 31.543081283569336,
  'RandomForestMSE_BAG_L1': 11.122159719467163,
  'CatBoost_BAG_L1': 200.1438946723938,
  'ExtraTreesMSE_BAG_L1': 6.53611421585083,
  'NeuralNetFastAI_BAG_L1': 74.58950996398926,
  'WeightedEnsemble_L2': 0.7337920665740967,
  'LightGBMXT_BAG_L2': 54.594863176345825,
  'LightGBM_BAG_L2': 25.471513032913208,
  'RandomForestMSE_BAG_L2': 26.825663328170776,
  'CatBoost_BAG_L2': 57.02219581604004,
  'WeightedEnsemble_L3': 0.2786529064178467},
 'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10460901260375977,
  'KNeighborsDist_BAG_L1': 0.10368776321411133,
  'LightGBMXT_BAG_L1': 6.754477024078369,
  'LightGBM_BAG_L1': 1.400993824005127,
  'RandomForestMSE_BAG_L1': 0.5528538227081299,
  'CatBoost_BAG_L1': 0.08083105087280273,
  'ExtraTreesMSE_BAG_L1': 0.6820335388183594,
  'NeuralNetFastAI_BAG_L1': 0.7653286457061768,
  'WeightedEnsemble_L2': 0.0011434555053710938,
  'LightGBMXT_BAG_L2': 3.1150238513946533,
  'LightGBM_BAG_L2': 0.29674410820007324,
  'RandomForestMSE_BAG_L2': 0.6180357933044434,
  'CatBoost_BAG_L2': 0.06748366355895996,
  'WeightedEnsemble_L3': 0.0008172988891601562},
 'num_bag_folds': 8,
 'max_stack_level': 3,
 'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'KNeighborsDist_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'LightGBMXT_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L2': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBMXT_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L3': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True}},
 'leaderboard':                      model   score_val  pred_time_val    fit_time  \
 0      WeightedEnsemble_L3  -53.114374      14.542919  553.766336   
 1   RandomForestMSE_BAG_L2  -53.420043      11.062850  416.399112   
 2          LightGBM_BAG_L2  -55.065570      10.741559  415.044961   
 3          CatBoost_BAG_L2  -55.749057      10.512298  446.595644   
 4        LightGBMXT_BAG_L2  -60.411326      13.559839  444.168311   
 5    KNeighborsDist_BAG_L1  -84.125061       0.103688    0.029149   
 6      WeightedEnsemble_L2  -84.125061       0.104831    0.762941   
 7    KNeighborsUnif_BAG_L1 -101.546199       0.104609    0.032093   
 8   RandomForestMSE_BAG_L1 -116.544294       0.552854   11.122160   
 9     ExtraTreesMSE_BAG_L1 -124.588053       0.682034    6.536114   
 10         CatBoost_BAG_L1 -130.525167       0.080831  200.143895   
 11         LightGBM_BAG_L1 -131.054162       1.400994   31.543081   
 12       LightGBMXT_BAG_L1 -131.460909       6.754477   65.577447   
 13  NeuralNetFastAI_BAG_L1 -138.914862       0.765329   74.589510   
 
     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
 0                 0.000817           0.278653            3       True   
 1                 0.618036          26.825663            2       True   
 2                 0.296744          25.471513            2       True   
 3                 0.067484          57.022196            2       True   
 4                 3.115024          54.594863            2       True   
 5                 0.103688           0.029149            1       True   
 6                 0.001143           0.733792            2       True   
 7                 0.104609           0.032093            1       True   
 8                 0.552854          11.122160            1       True   
 9                 0.682034           6.536114            1       True   
 10                0.080831         200.143895            1       True   
 11                1.400994          31.543081            1       True   
 12                6.754477          65.577447            1       True   
 13                0.765329          74.589510            1       True   
 
     fit_order  
 0          14  
 1          12  
 2          11  
 3          13  
 4          10  
 5           2  
 6           9  
 7           1  
 8           5  
 9           7  
 10          6  
 11          4  
 12          3  
 13          8  }
In [15]:
predictor.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
Out[15]:
<AxesSubplot:xlabel='model'>
In [16]:
leaderboard = predictor.leaderboard(silent=True)
leaderboard["description"] = "001 basic features"
leaderboard.to_csv("leaderboard.csv", index=False)
leaderboard.head()
Out[16]:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order description
0 WeightedEnsemble_L3 -53.114374 14.542919 553.766336 0.000817 0.278653 3 True 14 001 basic features
1 RandomForestMSE_BAG_L2 -53.420043 11.062850 416.399112 0.618036 26.825663 2 True 12 001 basic features
2 LightGBM_BAG_L2 -55.065570 10.741559 415.044961 0.296744 25.471513 2 True 11 001 basic features
3 CatBoost_BAG_L2 -55.749057 10.512298 446.595644 0.067484 57.022196 2 True 13 001 basic features
4 LightGBMXT_BAG_L2 -60.411326 13.559839 444.168311 3.115024 54.594863 2 True 10 001 basic features

Create predictions from test dataset¶

In [17]:
predictions = predictor.predict(test)
predictions.head()
Out[17]:
0    23.152916
1    41.841251
2    45.808411
3    49.782307
4    52.052742
Name: count, dtype: float32

NOTE: Kaggle will reject the submission if we don't set everything to be > 0.¶

In [18]:
# Describe the `predictions` series to see if there are any negative values
predictions.describe()
Out[18]:
count    6493.000000
mean      100.730713
std        89.761986
min         3.153492
25%        19.992170
50%        64.159775
75%       167.717422
max       365.000427
Name: count, dtype: float64
In [19]:
# How many negative values do we have?
x = 0
for i in predictions:
    if i < 0:
        x += 1
print(x)
0
In [ ]:
# Set them to zero

Set predictions to submission dataframe, save, and submit¶

In [20]:
submission["count"] = predictions
submission.to_csv("submission.csv", index=False)
In [21]:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"
100%|█████████████████████████████████████████| 188k/188k [00:00<00:00, 376kB/s]
Successfully submitted to Bike Sharing Demand

View submission via the command line or in the web browser under the competition's page - My Submissions¶

In [22]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName                     date                 description                        status    publicScore  privateScore  
---------------------------  -------------------  ---------------------------------  --------  -----------  ------------  
submission.csv               2022-12-30 04:53:09  first raw submission               complete  1.79188      1.79188       
submission_hpo.csv           2022-12-30 04:34:39  new features and hpo               complete  0.62542      0.62542       
submission_new_features.csv  2022-12-30 04:16:09  model with new features            complete  0.60781      0.60781       
submission_new_hpo.csv       2022-12-30 03:44:56  new features and hpo               complete  0.48505      0.48505       
tail: write error: Broken pipe

Initial score of ?¶

In [23]:
#Score: 1.79188

Step 4: Exploratory Data Analysis and Creating an additional feature¶

  • Any additional feature will do, but a great suggestion would be to separate out the datetime into hour, day, or month parts.
In [24]:
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist(figsize=(12, 10))
plt.show()
In [25]:
# Create a new feature
train["hour"] = train["datetime"].dt.hour
test["hour"] = test["datetime"].dt.hour
train.head()
Out[25]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4
  • Aditional features are explored:
In [26]:
# Profiler report
profile = ProfileReport(train)
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
In [27]:
# Visualizations
# Distribution of hourly bike demand by time features
train.groupby([train["datetime"].dt.month, "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by month (train data)")
train.groupby([train["datetime"].dt.hour, "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by hour (train data)")
train.groupby([train["datetime"].dt.dayofweek, "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by dayofweek (train data)")
plt.show()
In [28]:
train.groupby(["holiday"])["count"].median().plot(
    kind='bar', title="Median of hourly bike demand by holiday (train data)")
plt.show()
In [29]:
# Distribution of hourly bike demand by weather features
train.groupby(["season", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by season (train data)")
train.groupby(["weather", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by weather (train data)")
train.groupby(["temp", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by temp (train data)")
train.groupby(["atemp", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by atemp (train data)")
train.groupby(["windspeed", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by windspeed (train data)")
train.groupby(["humidity", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by humidity (train data)")
plt.show()
  • Aditional features are generated:
In [30]:
# Distribution of events by time features
train["season"].value_counts().plot(
    kind='bar', title="Number of events by season (train data)")
plt.show()
train["weather"].value_counts().plot(
    kind='bar', title="Number of events by weather (train data)")
plt.show()
train["holiday"].value_counts().plot(
    kind='bar', title="Number of events by holiday (train data)")
plt.show()
train["workingday"].value_counts().plot(
    kind='bar', title="Number of events by workingday (train data)")
plt.show()
In [31]:
# Functions for generating new features values

def get_daytime(hour):
    if (hour >= 7) & (hour <= 9):
        return "morning"
    elif (hour >= 12) & (hour <= 15):
        return "lunch"
    elif (hour >= 16) & (hour <= 19):
        return "rush_hour"
    elif (hour >= 20) & (hour <= 23):
        return "night"
    else: return "other"
    
def get_tempcat(temp):
    if (temp >= 35):
        return "very hot"
    elif (temp >= 25) & (temp < 35):
        return "hot"
    elif (temp >= 15) & (temp < 25):
        return "warm"
    elif (temp >= 10) & (temp < 15):
        return "cool"
    else: return "cold"
    
def get_windcat(windspeed):
    if (windspeed > 20):
        return "windy"
    elif (windspeed > 10) & (windspeed <= 20):
        return "mild"
    else: return "low"
    
def get_humiditycat(humidity):
    if (humidity >= 80):
        return "high"
    elif (humidity > 40) & (humidity < 80):
        return "mild"
    else: return "low"
In [32]:
# New features are generated

train["daytime"] = train['hour'].apply(get_daytime)
test['daytime'] = test['hour'].apply(get_daytime)
train['atempcat'] = train['atemp'].apply(get_tempcat)
test['atempcat'] = test['atemp'].apply(get_tempcat)
train['windcat'] = train['windspeed'].apply(get_windcat)
test['windcat'] = test['windspeed'].apply(get_windcat)
train['humiditycat'] = train['humidity'].apply(get_humiditycat)
test['humiditycat'] = test['humidity'].apply(get_humiditycat)
In [33]:
train.head()
Out[33]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour daytime atempcat windcat humiditycat
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 other cool low high
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 other cool low high
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 other cool low high
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 other cool low mild
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 other cool low mild
In [34]:
train["daytime"].value_counts().plot(
    kind='bar', title="Number of events by daytime (train data)")
plt.show()
train["atempcat"].value_counts().plot(
    kind='bar', title="Number of events by atempcat (train data)")
plt.show()
train["windcat"].value_counts().plot(
    kind='bar', title="Number of events by windcat (train data)")
plt.show()
train["humiditycat"].value_counts().plot(
    kind='bar', title="Number of events by humiditycat (train data)")
plt.show()

Make category types for these so models know they are not just numbers¶

  • AutoGluon originally sees these as ints, but in reality they are int representations of a category.
  • Setting the dtype to category will classify these as categories in AutoGluon.
In [35]:
category_list = ["season", "weather", "holiday", "workingday"]
train[category_list] = train[category_list].astype("category")
test[category_list] = test[category_list].astype("category")
  • New features types are set:
In [36]:
new_category_list = ["daytime", "atempcat", "windcat", "humiditycat"]
train[new_category_list] = train[new_category_list].astype("category")
test[new_category_list] = test[new_category_list].astype("category")
In [37]:
# View the new feature
train.info()
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     10886 non-null  datetime64[ns]
 1   season       10886 non-null  category      
 2   holiday      10886 non-null  category      
 3   workingday   10886 non-null  category      
 4   weather      10886 non-null  category      
 5   temp         10886 non-null  float64       
 6   atemp        10886 non-null  float64       
 7   humidity     10886 non-null  int64         
 8   windspeed    10886 non-null  float64       
 9   casual       10886 non-null  int64         
 10  registered   10886 non-null  int64         
 11  count        10886 non-null  int64         
 12  hour         10886 non-null  int64         
 13  daytime      10886 non-null  category      
 14  atempcat     10886 non-null  category      
 15  windcat      10886 non-null  category      
 16  humiditycat  10886 non-null  category      
dtypes: category(8), datetime64[ns](1), float64(3), int64(5)
memory usage: 851.9 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     6493 non-null   datetime64[ns]
 1   season       6493 non-null   category      
 2   holiday      6493 non-null   category      
 3   workingday   6493 non-null   category      
 4   weather      6493 non-null   category      
 5   temp         6493 non-null   float64       
 6   atemp        6493 non-null   float64       
 7   humidity     6493 non-null   int64         
 8   windspeed    6493 non-null   float64       
 9   hour         6493 non-null   int64         
 10  daytime      6493 non-null   category      
 11  atempcat     6493 non-null   category      
 12  windcat      6493 non-null   category      
 13  humiditycat  6493 non-null   category      
dtypes: category(8), datetime64[ns](1), float64(3), int64(2)
memory usage: 356.5 KB
In [38]:
# View histogram of all features again now with the hour feature
train.hist(figsize=(10, 8))
plt.show()
In [39]:
train.head()
Out[39]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour daytime atempcat windcat humiditycat
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 other cool low high
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 other cool low high
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 other cool low high
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 other cool low mild
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 other cool low mild

Step 5: Rerun the model with the same settings as before, just with more features¶

In [40]:
# Fit model
learner_kwargs = {
    "ignored_columns": ["casual", "registered"]
}

predictor_new_features = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression", 
    eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20221230_045712/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20221230_045712/"
AutoGluon Version:  0.6.1
Python Version:     3.7.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows:    10886
Train Data Columns: 16
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered']
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1772.55 MB
	Train Data (Original)  Memory Usage: 0.61 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
  good_rows = series[~series.isin(bad_rows)].astype(np.int64)
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('category', []) : 8 | ['season', 'holiday', 'workingday', 'weather', 'daytime', ...]
		('datetime', []) : 1 | ['datetime']
		('float', [])    : 3 | ['temp', 'atemp', 'windspeed']
		('int', [])      : 2 | ['humidity', 'hour']
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])             : 6 | ['season', 'weather', 'daytime', 'atempcat', 'windcat', ...]
		('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
		('int', [])                  : 2 | ['humidity', 'hour']
		('int', ['bool'])            : 2 | ['holiday', 'workingday']
		('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
	0.4s = Fit runtime
	14 features in original data used to generate 18 features in processed data.
	Train Data (Processed) Memory Usage: 0.96 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 0.47s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.59s of the 599.53s of remaining time.
	-101.5462	 = Validation score   (-root_mean_squared_error)
	0.04s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 399.2s of the 599.14s of remaining time.
	-84.1251	 = Validation score   (-root_mean_squared_error)
	0.03s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 398.84s of the 598.78s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-35.1225	 = Validation score   (-root_mean_squared_error)
	70.96s	 = Training   runtime
	6.39s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 321.87s of the 521.81s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-33.2233	 = Validation score   (-root_mean_squared_error)
	55.37s	 = Training   runtime
	5.12s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 261.33s of the 461.27s of remaining time.
	-38.6807	 = Validation score   (-root_mean_squared_error)
	13.91s	 = Training   runtime
	0.6s	 = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 244.39s of the 444.33s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-34.7403	 = Validation score   (-root_mean_squared_error)
	209.71s	 = Training   runtime
	0.2s	 = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 30.33s of the 230.27s of remaining time.
	-37.9695	 = Validation score   (-root_mean_squared_error)
	6.77s	 = Training   runtime
	0.58s	 = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 20.49s of the 220.43s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-78.2766	 = Validation score   (-root_mean_squared_error)
	43.4s	 = Training   runtime
	0.57s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 172.69s of remaining time.
	-32.1908	 = Validation score   (-root_mean_squared_error)
	0.64s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 171.95s of the 171.93s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-31.4643	 = Validation score   (-root_mean_squared_error)
	29.67s	 = Training   runtime
	0.74s	 = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 137.36s of the 137.34s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-30.5442	 = Validation score   (-root_mean_squared_error)
	28.09s	 = Training   runtime
	0.56s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 104.84s of the 104.82s of remaining time.
	-31.5157	 = Validation score   (-root_mean_squared_error)
	30.52s	 = Training   runtime
	0.66s	 = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 71.39s of the 71.37s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-31.0877	 = Validation score   (-root_mean_squared_error)
	70.74s	 = Training   runtime
	0.15s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -4.99s of remaining time.
	-30.3377	 = Validation score   (-root_mean_squared_error)
	0.42s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 605.62s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20221230_045712/")
In [41]:
predictor_new_features.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -30.337686      15.769274  559.629112                0.001314           0.419274            3       True         14
1          LightGBM_BAG_L2  -30.544203      14.215940  428.281188                0.555478          28.091259            2       True         11
2          CatBoost_BAG_L2  -31.087709      13.813266  470.927791                0.152805          70.737862            2       True         13
3        LightGBMXT_BAG_L2  -31.464343      14.400542  429.859494                0.740081          29.669564            2       True         10
4   RandomForestMSE_BAG_L2  -31.515716      14.319597  430.711153                0.659136          30.521224            2       True         12
5      WeightedEnsemble_L2  -32.190822      12.409169  350.620753                0.001431           0.644369            2       True          9
6          LightGBM_BAG_L1  -33.223304       5.116870   55.371105                5.116870          55.371105            1       True          4
7          CatBoost_BAG_L1  -34.740254       0.202554  209.710110                0.202554         209.710110            1       True          6
8        LightGBMXT_BAG_L1  -35.122505       6.385618   70.961028                6.385618          70.961028            1       True          3
9     ExtraTreesMSE_BAG_L1  -37.969525       0.583711    6.771391                0.583711           6.771391            1       True          7
10  RandomForestMSE_BAG_L1  -38.680737       0.598819   13.906645                0.598819          13.906645            1       True          5
11  NeuralNetFastAI_BAG_L1  -78.276613       0.565712   43.403863                0.565712          43.403863            1       True          8
12   KNeighborsDist_BAG_L1  -84.125061       0.103877    0.027496                0.103877           0.027496            1       True          2
13   KNeighborsUnif_BAG_L1 -101.546199       0.103299    0.038291                0.103299           0.038291            1       True          1
Number of models trained: 14
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_NNFastAiTabular', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_KNN', 'StackerEnsembleModel_RF'}
Bagging used: True  (with 8 folds)
Multi-layer stack-ensembling used: True  (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', [])             : 6 | ['season', 'weather', 'daytime', 'atempcat', 'windcat', ...]
('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
('int', [])                  : 2 | ['humidity', 'hour']
('int', ['bool'])            : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20221230_045712/SummaryOfModels.html
*** End of fit() summary ***
Out[41]:
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
  'KNeighborsDist_BAG_L1': -84.12506123181602,
  'LightGBMXT_BAG_L1': -35.1225045152507,
  'LightGBM_BAG_L1': -33.22330386513564,
  'RandomForestMSE_BAG_L1': -38.68073745703023,
  'CatBoost_BAG_L1': -34.74025415038994,
  'ExtraTreesMSE_BAG_L1': -37.9695247790059,
  'NeuralNetFastAI_BAG_L1': -78.27661254918321,
  'WeightedEnsemble_L2': -32.19082176546677,
  'LightGBMXT_BAG_L2': -31.46434331779796,
  'LightGBM_BAG_L2': -30.54420311983629,
  'RandomForestMSE_BAG_L2': -31.515715795837146,
  'CatBoost_BAG_L2': -31.08770851309625,
  'WeightedEnsemble_L3': -30.337686490843712},
 'model_best': 'WeightedEnsemble_L3',
 'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/KNeighborsUnif_BAG_L1/',
  'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/KNeighborsDist_BAG_L1/',
  'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/LightGBMXT_BAG_L1/',
  'LightGBM_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/LightGBM_BAG_L1/',
  'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/RandomForestMSE_BAG_L1/',
  'CatBoost_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/CatBoost_BAG_L1/',
  'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/ExtraTreesMSE_BAG_L1/',
  'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/NeuralNetFastAI_BAG_L1/',
  'WeightedEnsemble_L2': 'AutogluonModels/ag-20221230_045712/models/WeightedEnsemble_L2/',
  'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/LightGBMXT_BAG_L2/',
  'LightGBM_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/LightGBM_BAG_L2/',
  'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/RandomForestMSE_BAG_L2/',
  'CatBoost_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/CatBoost_BAG_L2/',
  'WeightedEnsemble_L3': 'AutogluonModels/ag-20221230_045712/models/WeightedEnsemble_L3/'},
 'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.03829073905944824,
  'KNeighborsDist_BAG_L1': 0.0274960994720459,
  'LightGBMXT_BAG_L1': 70.96102786064148,
  'LightGBM_BAG_L1': 55.3711051940918,
  'RandomForestMSE_BAG_L1': 13.906644821166992,
  'CatBoost_BAG_L1': 209.71011018753052,
  'ExtraTreesMSE_BAG_L1': 6.771391153335571,
  'NeuralNetFastAI_BAG_L1': 43.40386343002319,
  'WeightedEnsemble_L2': 0.6443688869476318,
  'LightGBMXT_BAG_L2': 29.669564247131348,
  'LightGBM_BAG_L2': 28.091259002685547,
  'RandomForestMSE_BAG_L2': 30.521223545074463,
  'CatBoost_BAG_L2': 70.73786163330078,
  'WeightedEnsemble_L3': 0.41927385330200195},
 'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10329937934875488,
  'KNeighborsDist_BAG_L1': 0.10387706756591797,
  'LightGBMXT_BAG_L1': 6.385617971420288,
  'LightGBM_BAG_L1': 5.116869926452637,
  'RandomForestMSE_BAG_L1': 0.5988190174102783,
  'CatBoost_BAG_L1': 0.20255446434020996,
  'ExtraTreesMSE_BAG_L1': 0.5837111473083496,
  'NeuralNetFastAI_BAG_L1': 0.5657124519348145,
  'WeightedEnsemble_L2': 0.0014307498931884766,
  'LightGBMXT_BAG_L2': 0.7400805950164795,
  'LightGBM_BAG_L2': 0.5554780960083008,
  'RandomForestMSE_BAG_L2': 0.6591355800628662,
  'CatBoost_BAG_L2': 0.15280485153198242,
  'WeightedEnsemble_L3': 0.0013136863708496094},
 'num_bag_folds': 8,
 'max_stack_level': 3,
 'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'KNeighborsDist_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'LightGBMXT_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L2': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBMXT_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L3': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True}},
 'leaderboard':                      model   score_val  pred_time_val    fit_time  \
 0      WeightedEnsemble_L3  -30.337686      15.769274  559.629112   
 1          LightGBM_BAG_L2  -30.544203      14.215940  428.281188   
 2          CatBoost_BAG_L2  -31.087709      13.813266  470.927791   
 3        LightGBMXT_BAG_L2  -31.464343      14.400542  429.859494   
 4   RandomForestMSE_BAG_L2  -31.515716      14.319597  430.711153   
 5      WeightedEnsemble_L2  -32.190822      12.409169  350.620753   
 6          LightGBM_BAG_L1  -33.223304       5.116870   55.371105   
 7          CatBoost_BAG_L1  -34.740254       0.202554  209.710110   
 8        LightGBMXT_BAG_L1  -35.122505       6.385618   70.961028   
 9     ExtraTreesMSE_BAG_L1  -37.969525       0.583711    6.771391   
 10  RandomForestMSE_BAG_L1  -38.680737       0.598819   13.906645   
 11  NeuralNetFastAI_BAG_L1  -78.276613       0.565712   43.403863   
 12   KNeighborsDist_BAG_L1  -84.125061       0.103877    0.027496   
 13   KNeighborsUnif_BAG_L1 -101.546199       0.103299    0.038291   
 
     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
 0                 0.001314           0.419274            3       True   
 1                 0.555478          28.091259            2       True   
 2                 0.152805          70.737862            2       True   
 3                 0.740081          29.669564            2       True   
 4                 0.659136          30.521224            2       True   
 5                 0.001431           0.644369            2       True   
 6                 5.116870          55.371105            1       True   
 7                 0.202554         209.710110            1       True   
 8                 6.385618          70.961028            1       True   
 9                 0.583711           6.771391            1       True   
 10                0.598819          13.906645            1       True   
 11                0.565712          43.403863            1       True   
 12                0.103877           0.027496            1       True   
 13                0.103299           0.038291            1       True   
 
     fit_order  
 0          14  
 1          11  
 2          13  
 3          10  
 4          12  
 5           9  
 6           4  
 7           6  
 8           3  
 9           7  
 10          5  
 11          8  
 12          2  
 13          1  }
In [42]:
predictor_new_features.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
Out[42]:
<AxesSubplot:xlabel='model'>
In [43]:
# Save training/validation scores
leaderboard_nf = predictor_new_features.leaderboard(silent=True)
leaderboard_nf["description"] = "new features added"
leaderboard_nf.to_csv("leaderboard_new_features.csv", index=False)
leaderboard_nf.head()
Out[43]:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order description
0 WeightedEnsemble_L3 -30.337686 15.769274 559.629112 0.001314 0.419274 3 True 14 new features added
1 LightGBM_BAG_L2 -30.544203 14.215940 428.281188 0.555478 28.091259 2 True 11 new features added
2 CatBoost_BAG_L2 -31.087709 13.813266 470.927791 0.152805 70.737862 2 True 13 new features added
3 LightGBMXT_BAG_L2 -31.464343 14.400542 429.859494 0.740081 29.669564 2 True 10 new features added
4 RandomForestMSE_BAG_L2 -31.515716 14.319597 430.711153 0.659136 30.521224 2 True 12 new features added
In [44]:
predictions_nf = predictor_new_features.predict(test)
predictions_nf.head()
Out[44]:
0    14.414740
1    10.229486
2     9.815841
3     8.203454
4     7.122305
Name: count, dtype: float32
In [45]:
predictions_nf.describe()
Out[45]:
count    6493.000000
mean      161.998306
std       143.446411
min         2.800360
25%        48.185959
50%       126.332138
75%       229.810394
max       816.655762
Name: count, dtype: float64
In [46]:
# Remember to set all negative values to zero
x = 0
for i in predictions_nf:
    if i < 0:
        i = 0
        x += 1
print(x)
0
In [47]:
submission_new_features = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_new_features.head()
Out[47]:
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0
In [48]:
# Same submitting predictions
submission_new_features["count"] = predictions_nf.round(0).astype(int)
submission_new_features.to_csv("submission_new_features.csv", index=False)
In [49]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "model with new features"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 373kB/s]
Successfully submitted to Bike Sharing Demand
In [50]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName                     date                 description                        status    publicScore  privateScore  
---------------------------  -------------------  ---------------------------------  --------  -----------  ------------  
submission_new_features.csv  2022-12-30 05:07:57  model with new features            complete  0.61186      0.61186       
submission.csv               2022-12-30 04:53:09  first raw submission               complete  1.79188      1.79188       
submission_hpo.csv           2022-12-30 04:34:39  new features and hpo               complete  0.62542      0.62542       
submission_new_features.csv  2022-12-30 04:16:09  model with new features            complete  0.60781      0.60781       
tail: write error: Broken pipe

New Score of ?¶

In [51]:
#Score with one additional feature (hour): 0.67642
#Score with more features: 0.61186

Step 6: Hyper parameter optimization¶

  • There are many options for hyper parameter optimization.
  • Options are to change the AutoGluon higher level parameters or the individual model hyperparameters.
  • The hyperparameters of the models themselves that are in AutoGluon. Those need the hyperparameter and hyperparameter_tune_kwargs arguments.
In [52]:
import autogluon.core as ag

#hyperparameters
nn_options = {  # specifies non-default hyperparameter values for neural network models
    'num_epochs': 10,  # number of training epochs (controls training time of NN models)
    'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True),  # learning rate used in training (real-valued hyperparameter searched on log-scale)
    'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),  # activation function used in NN (categorical hyperparameter, default = first entry)
    'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1),  # dropout probability (real-valued hyperparameter)
}

gbm_options = {  # specifies non-default hyperparameter values for lightGBM gradient boosted trees
    'num_boost_round': 100,  # number of boosting rounds (controls training time of GBM models)
    'num_leaves': ag.space.Int(lower=26, upper=66, default=36),  # number of leaves in trees (integer hyperparameter)
}

hyperparameters = {  # hyperparameters of each model type
                   'GBM': gbm_options,
                   'NN_TORCH': nn_options,  # NOTE: comment this line out if you get errors on Mac OSX
                  }  # When these keys are missing from hyperparameters dict, no models of that type are trained


#hyperparameter_tune_kwargs
num_trials = 5  # try at most 5 different hyperparameter configurations for each type of model
search_strategy = 'auto'  # to tune hyperparameters using random search routine with a local scheduler

hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified
    'num_trials': num_trials,
    'scheduler' : 'local',
    'searcher': search_strategy,
}

learner_kwargs = {
    "ignored_columns": ["casual", "registered"]
}

predictor_new_hpo = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression", 
    eval_metric="root_mean_squared_error").fit(
        train_data=train, 
        time_limit=600,  
        num_stack_levels=3, 
        num_bag_folds=10, 
        num_bag_sets=20,
        hyperparameters=hyperparameters,
        hyperparameter_tune_kwargs=hyperparameter_tune_kwargs)
No model was trained during hyperparameter tuning NeuralNetTorch_BAG_L4... Skipping this model.
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L5 ... Training model for up to 360.0s of the 102.67s of remaining time.
	-38.9944	 = Validation score   (-root_mean_squared_error)
	0.42s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 497.97s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20221230_053934/")
In [53]:
predictor_new_hpo.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L3 -37.234105       0.001797  126.020105                0.001361           0.367549            3       True          7
1    LightGBM_BAG_L2/T3 -37.349089       0.000348   94.059461                0.000141          32.468658            2       True          6
2    LightGBM_BAG_L2/T2 -37.561850       0.000296   93.183897                0.000088          31.593095            2       True          5
3    LightGBM_BAG_L2/T1 -37.659894       0.000299   93.165067                0.000091          31.574265            2       True          4
4   WeightedEnsemble_L4 -37.847635       0.002242  252.986472                0.001394           0.589363            4       True         11
5    LightGBM_BAG_L3/T2 -37.933480       0.000627  188.578799                0.000100          31.351979            3       True          9
6    LightGBM_BAG_L3/T1 -37.998869       0.000615  189.135905                0.000088          31.909084            3       True          8
7    LightGBM_BAG_L3/T3 -38.185753       0.000661  189.136046                0.000133          31.909225            3       True         10
8    LightGBM_BAG_L1/T2 -38.539657       0.000103   30.861836                0.000103          30.861836            1       True          2
9   WeightedEnsemble_L2 -38.539657       0.003394   31.164896                0.003291           0.303059            2       True          3
10  WeightedEnsemble_L5 -38.994413       0.002591  349.357601                0.001472           0.418964            5       True         15
11   LightGBM_BAG_L4/T1 -39.101755       0.000936  284.917474                0.000088          32.520365            4       True         12
12   LightGBM_BAG_L4/T2 -39.170227       0.000946  283.458624                0.000098          31.061514            4       True         13
13   LightGBM_BAG_L4/T3 -39.225552       0.000934  285.356759                0.000085          32.959650            4       True         14
14   LightGBM_BAG_L1/T1 -40.206175       0.000104   30.728966                0.000104          30.728966            1       True          1
Number of models trained: 15
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel_LGB'}
Bagging used: True  (with 10 folds)
Multi-layer stack-ensembling used: True  (with 5 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', [])             : 6 | ['season', 'weather', 'daytime', 'atempcat', 'windcat', ...]
('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
('int', [])                  : 2 | ['humidity', 'hour']
('int', ['bool'])            : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20221230_053934/SummaryOfModels.html
*** End of fit() summary ***
Out[53]:
{'model_types': {'LightGBM_BAG_L1/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1/T2': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBM_BAG_L2/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2/T2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2/T3': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel',
  'LightGBM_BAG_L3/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L3/T2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L3/T3': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L4': 'WeightedEnsembleModel',
  'LightGBM_BAG_L4/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L4/T2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L4/T3': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L5': 'WeightedEnsembleModel'},
 'model_performance': {'LightGBM_BAG_L1/T1': -40.2061752974064,
  'LightGBM_BAG_L1/T2': -38.53965666990459,
  'WeightedEnsemble_L2': -38.53965666990459,
  'LightGBM_BAG_L2/T1': -37.659894371651326,
  'LightGBM_BAG_L2/T2': -37.561849723350264,
  'LightGBM_BAG_L2/T3': -37.34908938630387,
  'WeightedEnsemble_L3': -37.23410485308833,
  'LightGBM_BAG_L3/T1': -37.99886917604992,
  'LightGBM_BAG_L3/T2': -37.933479660418755,
  'LightGBM_BAG_L3/T3': -38.18575313187197,
  'WeightedEnsemble_L4': -37.84763528055502,
  'LightGBM_BAG_L4/T1': -39.10175531610709,
  'LightGBM_BAG_L4/T2': -39.17022731113232,
  'LightGBM_BAG_L4/T3': -39.225551602020325,
  'WeightedEnsemble_L5': -38.99441314595425},
 'model_best': 'WeightedEnsemble_L3',
 'model_paths': {'LightGBM_BAG_L1/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L1/T1/',
  'LightGBM_BAG_L1/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L1/T2/',
  'WeightedEnsemble_L2': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L2/',
  'LightGBM_BAG_L2/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L2/T1/',
  'LightGBM_BAG_L2/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L2/T2/',
  'LightGBM_BAG_L2/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L2/T3/',
  'WeightedEnsemble_L3': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L3/',
  'LightGBM_BAG_L3/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L3/T1/',
  'LightGBM_BAG_L3/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L3/T2/',
  'LightGBM_BAG_L3/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L3/T3/',
  'WeightedEnsemble_L4': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L4/',
  'LightGBM_BAG_L4/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L4/T1/',
  'LightGBM_BAG_L4/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L4/T2/',
  'LightGBM_BAG_L4/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L4/T3/',
  'WeightedEnsemble_L5': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L5/'},
 'model_fit_times': {'LightGBM_BAG_L1/T1': 30.728965759277344,
  'LightGBM_BAG_L1/T2': 30.861836433410645,
  'WeightedEnsemble_L2': 0.30305910110473633,
  'LightGBM_BAG_L2/T1': 31.574265003204346,
  'LightGBM_BAG_L2/T2': 31.59309482574463,
  'LightGBM_BAG_L2/T3': 32.468658447265625,
  'WeightedEnsemble_L3': 0.36754918098449707,
  'LightGBM_BAG_L3/T1': 31.90908432006836,
  'LightGBM_BAG_L3/T2': 31.35197901725769,
  'LightGBM_BAG_L3/T3': 31.909225463867188,
  'WeightedEnsemble_L4': 0.5893630981445312,
  'LightGBM_BAG_L4/T1': 32.52036452293396,
  'LightGBM_BAG_L4/T2': 31.061514377593994,
  'LightGBM_BAG_L4/T3': 32.95964956283569,
  'WeightedEnsemble_L5': 0.4189636707305908},
 'model_pred_times': {'LightGBM_BAG_L1/T1': 0.00010442733764648438,
  'LightGBM_BAG_L1/T2': 0.00010323524475097656,
  'WeightedEnsemble_L2': 0.0032906532287597656,
  'LightGBM_BAG_L2/T1': 9.107589721679688e-05,
  'LightGBM_BAG_L2/T2': 8.797645568847656e-05,
  'LightGBM_BAG_L2/T3': 0.00014066696166992188,
  'WeightedEnsemble_L3': 0.0013608932495117188,
  'LightGBM_BAG_L3/T1': 8.797645568847656e-05,
  'LightGBM_BAG_L3/T2': 9.989738464355469e-05,
  'LightGBM_BAG_L3/T3': 0.00013327598571777344,
  'WeightedEnsemble_L4': 0.0013937950134277344,
  'LightGBM_BAG_L4/T1': 8.7738037109375e-05,
  'LightGBM_BAG_L4/T2': 9.751319885253906e-05,
  'LightGBM_BAG_L4/T3': 8.535385131835938e-05,
  'WeightedEnsemble_L5': 0.0014719963073730469},
 'num_bag_folds': 10,
 'max_stack_level': 5,
 'model_hyperparams': {'LightGBM_BAG_L1/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L2': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2/T3': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L3': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L3/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L3/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L3/T3': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L4': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L4/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L4/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L4/T3': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L5': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True}},
 'leaderboard':                   model  score_val  pred_time_val    fit_time  \
 0   WeightedEnsemble_L3 -37.234105       0.001797  126.020105   
 1    LightGBM_BAG_L2/T3 -37.349089       0.000348   94.059461   
 2    LightGBM_BAG_L2/T2 -37.561850       0.000296   93.183897   
 3    LightGBM_BAG_L2/T1 -37.659894       0.000299   93.165067   
 4   WeightedEnsemble_L4 -37.847635       0.002242  252.986472   
 5    LightGBM_BAG_L3/T2 -37.933480       0.000627  188.578799   
 6    LightGBM_BAG_L3/T1 -37.998869       0.000615  189.135905   
 7    LightGBM_BAG_L3/T3 -38.185753       0.000661  189.136046   
 8    LightGBM_BAG_L1/T2 -38.539657       0.000103   30.861836   
 9   WeightedEnsemble_L2 -38.539657       0.003394   31.164896   
 10  WeightedEnsemble_L5 -38.994413       0.002591  349.357601   
 11   LightGBM_BAG_L4/T1 -39.101755       0.000936  284.917474   
 12   LightGBM_BAG_L4/T2 -39.170227       0.000946  283.458624   
 13   LightGBM_BAG_L4/T3 -39.225552       0.000934  285.356759   
 14   LightGBM_BAG_L1/T1 -40.206175       0.000104   30.728966   
 
     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
 0                 0.001361           0.367549            3       True   
 1                 0.000141          32.468658            2       True   
 2                 0.000088          31.593095            2       True   
 3                 0.000091          31.574265            2       True   
 4                 0.001394           0.589363            4       True   
 5                 0.000100          31.351979            3       True   
 6                 0.000088          31.909084            3       True   
 7                 0.000133          31.909225            3       True   
 8                 0.000103          30.861836            1       True   
 9                 0.003291           0.303059            2       True   
 10                0.001472           0.418964            5       True   
 11                0.000088          32.520365            4       True   
 12                0.000098          31.061514            4       True   
 13                0.000085          32.959650            4       True   
 14                0.000104          30.728966            1       True   
 
     fit_order  
 0           7  
 1           6  
 2           5  
 3           4  
 4          11  
 5           9  
 6           8  
 7          10  
 8           2  
 9           3  
 10         15  
 11         12  
 12         13  
 13         14  
 14          1  }
In [54]:
predictor_new_hpo.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
Out[54]:
<AxesSubplot:xlabel='model'>
In [55]:
leaderboard_hpo = predictor_new_hpo.leaderboard(silent=True)
leaderboard_hpo["description"] = "hpo"
leaderboard_hpo.to_csv("leaderboard_hpo.csv", index=False)
leaderboard_hpo.head()
Out[55]:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order description
0 WeightedEnsemble_L3 -37.234105 0.001797 126.020105 0.001361 0.367549 3 True 7 hpo
1 LightGBM_BAG_L2/T3 -37.349089 0.000348 94.059461 0.000141 32.468658 2 True 6 hpo
2 LightGBM_BAG_L2/T2 -37.561850 0.000296 93.183897 0.000088 31.593095 2 True 5 hpo
3 LightGBM_BAG_L2/T1 -37.659894 0.000299 93.165067 0.000091 31.574265 2 True 4 hpo
4 WeightedEnsemble_L4 -37.847635 0.002242 252.986472 0.001394 0.589363 4 True 11 hpo
In [56]:
predictions_hpo = predictor_new_hpo.predict(test)
predictions_hpo.head()
Out[56]:
0    9.378200
1    6.759612
2    6.675080
3    6.639610
4    6.622023
Name: count, dtype: float32
In [57]:
predictions_hpo.describe()
Out[57]:
count    6493.000000
mean      191.546890
std       173.236862
min         5.678326
25%        48.135056
50%       148.766876
75%       287.000244
max       869.541260
Name: count, dtype: float64
In [58]:
# Remember to set all negative values to zero
x = 0
for i in predictions_hpo:
    if i < 0:
        i = 0
        x += 1
print(x)
0
In [59]:
submission_hpo = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_hpo.head()
Out[59]:
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0
In [60]:
# Same submitting predictions
submission_hpo["count"] = predictions_hpo.round(0).astype(int)
submission_hpo.to_csv("submission_hpo.csv", index=False)
In [61]:
!kaggle competitions submit -c bike-sharing-demand -f submission_hpo.csv -m "new features and hpo"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 312kB/s]
Successfully submitted to Bike Sharing Demand
In [62]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName                     date                 description                        status    publicScore  privateScore  
---------------------------  -------------------  ---------------------------------  --------  -----------  ------------  
submission_hpo.csv           2022-12-30 05:47:57  new features and hpo               complete  0.48165      0.48165       
submission_new_features.csv  2022-12-30 05:07:57  model with new features            complete  0.61186      0.61186       
submission.csv               2022-12-30 04:53:09  first raw submission               complete  1.79188      1.79188       
submission_hpo.csv           2022-12-30 04:34:39  new features and hpo               complete  0.62542      0.62542       
tail: write error: Broken pipe

New Score of ?¶

In [63]:
#Score: 0.48165
#score (default hpo): 0.62542

Step 7: Write a Report¶

Refer to the markdown file for the full report¶

Creating plots and table for report¶

In [64]:
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
    {
        "model": ["initial", "add_features", "hpo"],
        "score": [53.114374, 30.337686, 37.234105]
    }
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_train_score.png')
In [65]:
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
    {
        "test_eval": ["initial", "add_features", "hpo"],
        "score": [1.79188, 0.61186, 0.48505]
    }
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_test_score.png')

Hyperparameter table¶

In [66]:
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
    "model": ["initial", "add_features", "hpo"],
    "num_stack_levels": [1, 1, 3],
    "num_bag_folds": [8, 8, 10],
    "num_bag_sets": [20, 20, 20],
    "score": [1.80502, 0.67176, 0.48165]
})
Out[66]:
model num_stack_levels num_bag_folds num_bag_sets score
0 initial 1 8 20 1.80502
1 add_features 1 8 20 0.67176
2 hpo 3 10 20 0.48165

References¶

  • How to use AutoGluon for Kaggle competitions. AutoGluon. https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-kaggle.html
  • AutoGluon Predictors. AutoGluon. https://auto.gluon.ai/stable/api/autogluon.predictor.html#autogluon.tabular.TabularPredictor.fit
  • Predicting Columns in a Table - In Depth. AutoGluon. https://auto.gluon.ai/0.0.15/tutorials/tabular_prediction/tabular-indepth.html
  • Time-related feature engineering. Scikit-learn. https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html
  • Feature Engineering Examples: Binning Numerical Features. Towards Data Science. https://towardsdatascience.com/feature-engineering-examples-binning-numerical-features-7627149093d